Systematic literature review

Vincent Bagilet https://www.sipa.columbia.edu/experience-sipa/sipa-profiles/vincent-bagilet (Columbia University)https://www.columbia.edu/ , Léo Zabrocki https://www.parisschoolofeconomics.eu/en/ (Paris School of Economics)https://www.parisschoolofeconomics.eu/en/
2020-12-14

Purpose of the document

In the present document, we aim to conduct a systematic review of the literature on short term health effects of air pollution. The objective is two fold: - Retrieve effect sizes and confidence intervals in order to compute power, type M and type S error in the literature - Get a sense of the proportion of papers in this literature discussing power and missing data issues.

Motivation

In this section, we discuss the importance of such an analysis. (Yet to be written)

Power analysis

In this section, we implement robustness tests in order to compute the power, type M and type S error in the studied articles. We look at what would be the power, type M and type S error if the true effect was a fraction of the measured effect. We retrieved estimates and confidence intervals of articles in the literature of interest in another document. Before looking into the power analysis itself, we look at the characteristics of the articles considered.

Articles characteristics

Full set of articles

We retrieved the articles using the following query:

‘TITLE((“air pollution” OR “air quality” OR “particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”) AND (“emergency” OR “mortality”) AND NOT (“long term” or “long-term”)) AND (“particulate matter” OR ozone OR “nitrogen dioxide” OR “sulfur dioxide”)’

This query returns 1649 articles. Based on the abstracts, we can briefly explore the main (unsurprising) themes of the articles:

Abstracts with effects and confidence intervals

Out of all articles returned by the query, 700 display confidence intervals. “CI”, “confidence interval”, etc:

In these articles, we retrieve valid effects and confidence intervals in the following proportions:1

Table 1: Number of articles for which at least one effect is retrieved (out of those containing the phrase ‘CI’)
Effect retreived Number of articles Proportion
Yes 592 0.8457143
No 108 0.1542857

This corresponds to 1858 valid effects and associated confidence intervals.

Here is a random example of the effects and confidence intervals detected by our method:

Comparison between abstracts with and without detected effects

In this subsection, we investigate whether there are systematic differences between articles displaying an effect that we detected in the abstract and articles that do not display an effect or for which we did not detect the effect.

We first wonder whether there are disparities in publication dates. It might be the case that displaying effects in the abstract was a feature of a given period.

Even though there are slightly more recent (2010-2020) articles for which effects are retrieved, the difference does not seem to be substantial.

We also investigate whether there are differences in the journals in which the articles are published.

For this analysis to be informative, we would need to cluster the journals into groups (eg epidemiology journals, general science journals, etc).

Then, we wonder if the the themes considered in each types of abstracts differ.

Apart from a few key terms, such as CI, 95 for instance, there are no huge variations in the themes.

We do not seem to detect effects more for a pollutant than for another. Note that if an article considers several pollutants, it will appear several times in this graph.

Now that we have quickly compared the articles for which we retrieve an effect an those for which we don’t, we can dig further into the analysis of the estimates retrieved.

Analysis of the effects

In this section, we briefly analyse the effects retrieved. First, we look into the proportion of effects which are significant.

Significant Number of effects Proportion
No 88 0.0473628
Yes 1770 0.9526372

Non surprisingly, most of the effects retrieved here are significant. These effects are reported in the abstracts and with confidence intervals.

We the look into the distribution of the t-scores.

We notice that there is some sort of bunching for t-scores above 1.96. We might need to investigate further whether there is some evidence of publication bias. Yet, our analysis itself might be bias to some extent since we are only considering estimates from the abstract. Authors may choose to report in the abstract only statistically significant estimates, even though they also have non significant estimates in the body of the article. We could investigate this further by reproducing the present analysis but on the full texts and not only on the abstracts.

We then plot the distribution of the signal to noise ratio, ie the ratio of the point estimate and the width of the confidence interval.

The graph is of course analogous to the previous one. It however informs us that in a large share of the studies, the magnitude of the noise is larger than the magnitude of the effect. Looking in more details into the distribution of the signal to noise ratio, we notice that for 40% of the estimates considered here, the magnitude of the noise is more important than those of the signal.

Signal to noise ratio
0% 0.0322581
10% 0.5384615
20% 0.6607381
30% 0.8277522
40% 1.0289348
50% 1.3533073
60% 2.2766337
70% 4.7265455
80% 10.0446281
90% 24.8030488
100% 834.8333333

Power analysis

We then turn to the power analysis itself. To do so, we use the package retrodesign which computes post analysis design calculations (power, type M and type S errors). We run retro_desing for several effect sizes.

Overal analysis

In a first part, we carry out our analysis on the whole set of articles. We notice that there is some heterogeneity across articles, with some articles displaying a high power and others displaying lower power. Thus, in a second part, we will look in more details at articles displaying low power

We start by computing the average and median power, type M and type S errors.

Proportion of the true effect
Power
Type M
Type S
Mean Median Mean Median Mean Median
0.01 0.1043085 0.0503224 55.922046 44.084318 0.3385168 0.4383054
0.05 0.2526895 0.0580982 11.339701 8.886364 0.1895556 0.2243359
0.10 0.3424199 0.0828139 5.831147 4.523520 0.1087653 0.0770272
0.33 0.5488371 0.4172006 2.093454 1.534451 0.0141896 0.0002478
0.50 0.6629577 0.7556944 1.579196 1.156594 0.0054233 0.0000026
0.67 0.7570693 0.9445702 1.345587 1.033339 0.0029559 0.0000000
0.75 0.7937769 0.9782421 1.277935 1.013360 0.0023906 0.0000000
0.90 0.8500446 0.9975569 1.190855 1.001596 0.0017246 0.0000000
1.00 0.8794131 0.9995884 1.151620 1.000280 0.0014349 0.0000000

Then, we look at the distribution of power, type M and type S error across simulation and for different size of true effect.

A large chunk of articles display high power and low rates of type M and type S error, in each robustness check. However, a non negligible number of articles display lower power and/or some evidence of type M error. Type S error do not seem to be an important issue here. We investigate potential causes for the lack of power and for the type M errors further in the next subsection.

Note that for type M errors, due to some outliers, we used a log scale. Without the log scale and restricting our sample to type M errors lower than 2.5 (95% of our sample, even for a effect considered being one third of the true effect).

We find that, even if the measured effect is the true effect, there is some risk of type M error.

Then, we look how type M and type S error evolve with power in the estimates considered.

There is a one-to-one relationship between power and type M and type S error. Not surprisingly, type M and type S error skyrocket in studies with low power.

We then investigate how power, type M, type S and

Power, type M and type S errors also skyrocket for small values of the true effect (as a proportion of the measured effect). In addition on average, if for each paper of the literature, the true effects are only three quarter of the measured effect, the power would be lower than the usual 80%. Type S error only seem to be an issue for small values of the true effect as a portion of the measured effect. Type M error seems to be more consistently problematic. The shoot up in the previous graph makes it difficult to read the values of type M error when the true effect is not a small portion of the measured effect. We therefore zoom in.

We notice that, on average in the literature, the treatment effects are overestimated, even for large values of the true effect. This result might be linked to some outliers. We therefore look into the evolution of the median effect with true effect size.

We notice that the issue is much less important when looking at the median. This suggests some heterogeneity in terms of power in the literature.

It might also be interesting to look at how power, type M and type S error evolved in time, ie with publication date.

There does not seem to be a clear trend in the evolution of power and type S error. However, type M error seems to have peaked in the 2010s and to be decreasing again recently.

Analysis of articles with low power

In the previous section, we noticed that a non negligible number of studies seemed to suffer from a low power issue and associated type M error. We consider that effects for which power is lower than 80% if the true effect is 3/4 of the measured effect. 80% is the threshold usually used in power analyses but 3/4 is arbitrary and could be changed easily in a robustness check. Following this criterion, the number and proportion of estimates with low power is as follows:

Power Number of estimates Proportion
Adequate power 1179 0.6345533
Low power 679 0.3654467

We investigate the particularities of the articles with low power. We start by reproducing the analyses used to compare articles for which we retreived an effect and those for which we did not. First, we look into the distribution of publication dates.

It seems that less articles with low power have been published recently, in comparison to articles with adequate power. This confirms our previous finding. We then look into the distribution of articles

Interestingly, some journals, such as “Science of the Total Environment”, the “International Journal of Occupational Medicine and Environmental Health”, the “Chochrane Database of Systematic Reviews”, “Environmental science and pollution research” and the “Journal of Exposure Science and Environmental epidemiology” publish large share of low power studies. On the contrary, BMJ Open publish very few low power studies.

Here also, grouping the journals into big main themes could be more instructive.

We also look into disparities

There does not seem to be stark differences by pollutant type.


  1. Note that a bunch of abstracts contain the phrase “CI” without actually displaying effects and confidence intervals.↩︎